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Abstract — Containment-based trees encompass various handy 
structures such as B+-trees, R-trees and M-trees. They are 
widely used to build data indexes, range-queryable overlays, 
publish/subscribe systems both in centralized and distributed 
contexts. In addition to their versatility, their balanced shape 
ensures an overall satisfactory performance. Recently, it has 
been shown that their distributed implementations can be fault- 
resilient. However, this robustness is achieved at the cost of 
unbalancing the structure. While the structure remains correct 
in terms of searchability, its performance can be significantly 
decreased. In this paper, we propose a distributed self-stabilizing 
algorithm to balance containment-based trees. 

Index Terms — self-stabilization, balancing algorithms, 
containment-based trees 

L Introduction 

Several tree families are based on a containment relation. 
Examples include B+-trees |[1], R-trees and M-trees JS). 
They are respectively designed to handle intervals, rectangles, 
and balls. Their logarithmic height ensures good performance 
for basic insertion/deletion/search primitives. Basically they 
rely on a partial order on node labels. They can be specified 
as follows: 

1) Tree nature. The graph is acyclic and connected. 

2) Containment relation. Every non-root node n satisfies 
label{n) C label{f ather{n)) . 

3) Bounded degrees. The root has between 2 and M 
children, each internal node has between m and M 
children (M > 2m). 

4) Balanced shape. All leaves are at the same level. 

In a distributed context, no node has a global knowledge 
of the system as each node has only access to its local 
information. As a consequence, the aforementioned invariants 
should be expressed as "local" constraints; at the level of 
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a node. Operations preserving or restoring those invariants 
should also be "as local as possible". 

Preserving the tree nature has been addressed in previous 
work f4l, (5), f6l. It is tightly related to the distribution model 
of the structure, the centralized case being far less stressing. 
The containment relation is easy to preserve as node labels 
can be "enlarged" or "shrunk". The bounded node degrees are 
ensured with split or collapse primitives when a node has too 
many or too few children, respectively. The balanced shape 
of the tree is especially important in terms of performance; in 
conjunction with node degree bounds, it ensures that the tree 
has logarithmic height. A number of approaches have been 
proposed in the literature in order to balance tree overlays. 
However, these solutions have many limitations in particular 
when applied to containment-based tree overlays. 

The works in |4|, |6| have the extra property of being self- 
stabilizing. A self-stabilizing system, as introduced in |7|, is 
guaranteed to converge to the intended behavior in finite time, 
regardless of the initial states of nodes. Self-stabilization fTl 
is a general technique to design distributed systems that can 
handle arbitrary transient faults. 

Figure [T] shows the overlay lifecycle borrowed from f6]. 
Rectangles refer to states of the distributed tree. Transitions 
are labelled with events (join and faults) or algorithms (repair 
and balance) triggering them Q. Initially the tree is empty, 
then some peers join the overlay building a searchable tree. 
The balancing algorithm eventually balances the tree. In case 
of joins the tree may become unbalanced. In case of faults the 
tree may become disconnected. The repair algorithm even- 
tually reconnects the tree and fixes containment relation. The 
distinction between searchable and balanced states emphasizes 
a separation of concerns between correction and performance. 
Basically repair is about correction, while balance is about 

'a balanced tree might remain balanced or might become searchable in 
case of faults, while it might remain balanced in case of joim. However for 
the sake of readability only most stressing — which are also the most likely — 
transitions are shown. 
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Fig. 1: Containment-based tree overlay lifecycle. The overlay is 
balanced when all invariants are ensured, searchable when all but 
the fourth invariant are ensured, and disconnected when at least the 
tree nature invariant is violated. 



performance. It is also interesting to point out that if the 
balance algorithm is self-stabilizing, then the join algorithm 
does not have to deal with performance. Moreover, if the 
repair algorithm is self-stabilizing, a peer quitting the overlay 
does not have to communicate with any other peer; departures 
can be handled as faults. 

A. Related Work 

DR-tree [4| is a distributed version of R-trees 1*21 developed 
to build a brokerless peer to peer publish/subscribe system. 
Each computer stores a leaf and some of its consecutive 
ancestors. The closer node to the root stored by a computer is 
called its "first node". Each computer knows some ancestors 
of its first node; from its grand father to the root. The tree 
may become unbalanced in case of faults. When a computer 
detects that the computer storing the father of its first node is 
crashed, it broadcasts a message in the subtree starting from 
its first node. All leaves belonging to that subtree will be 
reinserted elsewhere starting from an ancestor of the crashed 
peer's first node. As a consequence, no balancing primitive is 
used. However, up to half of the nodes of the tree may have 
to be reinserted (if faults occur near the root). 

SDR-tree lH) is an indexing structure distributed amongst 
a cluster of servers. Its main aim is to provide efficient 
and scalable range queries. It builds balanced full binary 
trees satisfying containment relation. Each computer stores 
exactly one leaf and one internal node. Instead of classic split 
and collapse operations, this paper extensively uses subtree 
heights to guarantee the balanced shape of the structure. They 
adapt the concept of rotation |9| to multidimensional data. 
However they don't mention how to repair the structure in 
case of crashes. 

VBI 15 1, a sequel of BATON flO'l, is a framework to build 
distributed spatial indexes on top of binary trees. It provides 
default implementations for all purely structural concerns of 
binary trees: maintaining father and children links, rotations, 
etc. Developers only have to focus on several operations, 
mostly related to node labeling (such as split and collapse). 
However, despite their interesting approach of mapping any 
tree on a binary one, they do not provide a handy way to 
parameterize the system. While modifying node degree bounds 
should be a simple way to tune system performances, the fixed 



degree of the core structure (a tradeoff between reusability and 
performance) cannot be bypassed. 

The solution proposed in ||6J also deals with full binary trees 
satisfying containment relation. The distribution is the same as 
in the SDR-tree; each computer stores exactly one leaf and one 
internal node. The core contribution of this paper is to prove 
that if the leaf and the internal node held by each machine are 
randomly chosen, then the graph between computers (namely, 
the communication graph) is very unlikely to be disconnected 
in case of faults. More precisely, faults disconnect the log- 
ical structure, but not the communication graph. This paper 
proposes an algorithm exploiting the connectivity of the com- 
munication graph to repair the logical structure restoring the 
tree nature, containment relation and bounded node degrees. 
This approach is shown as much cheaper than [4|, both in 
terms of recovery time and cost in messages. However the 
restored tree can be slightly unbalanced and the restoration of 
its shape is not addressed in the paper. 

The algorithm proposed in ||6l has the desirable property 
of being self-stabilizing. There exist several distributed, self- 
stabilizing algorithms in the literature that maintain a global 
property based on node labels, e.g., ifTTI . lfT2ll . lfT3l . The 
solutions in ifTDl . lfT2l achieve the required property (resp. 
Heap and EST) by reorganizing the key in the tree. They have 
no impact on the tree topology. The solution in ifTJl arranges 
both node labels (or, keys) and the tree topology. None of 
the above self-stabilizing solutions deal with the balanced 
property. 

B. Contribution 

In this paper, we first show that edge swapping can be 
used as a balancing primitive. Then we propose a distributed 
self-stabilizing algorithm balancing any containment-based 
tree, such as B-n-trees [[l], R-trees (l2] and M-trees ([3]. Our 
algorithm can be used in fSl, fSl to enhance the performance 
of repaired trees, or in [5 |, [10] as a "core" balancing mech- 
anism. We further prove the correctness of the algorithm and 
investigate its practical convergence speed via simulations. 

C. Roadmap 

In Section [III we argue that edge swapping is practically 
better suited than rotations to balance containment-based trees. 
In Section [Till we present the model we use to describe 
our algorithm. In Section |IV] we propose a distributed self- 
stabilizing algorithm that relies on edge swapping to balance 
any containment-based tree. In the same section, we prove the 
termination and correctness of the algorithm. In Section |V] 
we investigate the practical termination time of our algorithm 
via simulations. Section |VT] contains some concluding remarks 
and possible directions for future work. 

II. Balancing Primitive 

Let a G N. A tree is balanced iff all its nodes are balanced. 
A node is balanced iff the heights of any pair of its children 
differ at most by a. In the remainder of this paper, we make 
two assumptions: 
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(a) Swapping (a, c) and (fe, e). . . 




(b) ... to (a, e) and (b, c) 



Fig. 2: An example of edge swapping. Tlie fathers of c and e are 
modified. The label of b needs to be updated. 

. a > 1. 

• each non-leaf node has at least two children. 
The assumption on a is weaker than those of [8], 16], ID and 
still ensures logarithmic height of the tree [14|. The degree 
assumption is also weaker than those of |8], 16], ||5| and allows 
more practical tree configurations. 

Basically, a balancing primitive is an operation that even- 
tually reduces the difference between the heights of some 
subtrees by modifying several links of the tree. Those link 
modifications may have some semantic impact if they break 
node labeling invariants. 

A. Rotation 

In BST ||15il and AVL ||9l, the well known rotation primitive 
is used to ensure the balanced shape of the tree. However, 
when dealing with structures relying on partially ordered data, 
rotations do have a semantic impact. Moreover, in a distributed 
context, if a node n and its father or grandfather concurrently 
execute rotations, they may both "write" father {n). It follows 
that the use of distributed rotations requires synchronization 
to preserve tree structure. 

B. Edge swapping 

Given two edges (a, 6) and (c, d), swapping them con- 
sists in exchanging their tails (resp. heads). Formally, 
swap{{a, b), (c, d)) modifies the edges of the graph as follows: 
EiG) := EiG) - {(a,6),(c,d)} + {(a, d), (c, 6)}. Figure |2] 
contains an illustration. With the algorithm that we present 
in Section llVi concurrent swaps cannot conflict. The use of 
this balancing primitive is thus more suitable in a distributed 
context. 

III. Model 

In this paper, we consider the classical local shared memory 
model, known as the state model, that was introduced by Di- 
jkstra [7|. In this model, communications between neighbours 



are modeled by direct reading of variables instead of exchange 
of messages. The program of every node consists in a set of 
shared variables (henceforth referred to as variable) and a finite 
number of actions. Each node can write in its own variables 
and read its own variables and those of its neighbors. Each 
action is constituted as follows: 

< Label >::< Guard > < Statement > 

The guard of an action is a boolean expression involving 
the variables of a node u and its neighbours. The statement 
is an action which updates one or more variables of u. Note 
that an action can be executed only if its guard is true. Each 
execution is decomposed into steps. 

The state of a node u is defined by the value of its variables. 
It consists of the following pieces of information: 

• An integer value which we call the height value or height 
information of node u. 

m Two arrays that contain the IDs of the children of u and 
their height values. 
For the sake of generality, we do not make any assumptions 
about the number of bits available for storing height informa- 
tion, thus the height value of a node can be an arbitrarily large 
(positive or negative) integer 

The configuration of the system at any given time t is 
the aggregate of the states of the individual nodes. We will 
sometimes use the term "configuration" to refer to the rooted 
tree formed by the nodes. 

Definition 1: We denote by C{t) the configuration of the 
system at time t > {). For a node u, we denote by hu{t) the 
value of its height variable in C{t), by its actual height 

in the tree in C(t), by Su{t) the set of children of u in C{t), 
and by the set of nodes in the subtree rooted at u in C{t). 

Let C{t) be a configuration at instant t and let / be an action 
of a node u. I is enabled for u in C{t) if and only if the guard 
of / is satisfied by u in C{t). Node u is enabled in C{t) if 
and only if at least one action is enabled for u in C{t). Each 
step consists of two sequential phases executed atomically: 
{i) Every node evaluates its guard; (m) One or more enabled 
nodes execute their enabled actions. When the two phases are 
done, the next step begins. This execution model is known 
as the distributed daemon f\E\. To capture asynchrony, we 
assume a semi-synchronous scheduler which picks any non- 
empty subset of the enabled nodes in the current configuration 
and executes their actions simultaneously. We do not make any 
fairness assumptions, thus the scheduler is free to effectively 
ignore any particular node or set of nodes as long as there 
exists at least one other node that can be activated. 

Definition 2: We refer to the activation of a non-empty 
subset A of the enabled nodes in a given configuration as 
an execution step. If C is the configuration of the system 
before the activation of A and C is the resulting config- 
uration, we denote this particular step by C — >a C . An 
execution starting from an initial configuration Co is a sequence 
Co — >Ai Ci — C2 — > ... of execution steps. Time is 
measured by the number of steps that have been executed. 
An execution is completed when it reaches a configuration in 
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which no node is enabled. After that point, no node is ever 
activated. 

If a node was enabled before a particular execution step, 
was not activated in that step, and is not enabled after that 
step, then we say that it was neutralized after that step. 

Definition 3: Given a particular execution, an execution 
round (or simply round) starting from a configuration C 
consists of the minimum-length sequence of steps in which 
every enabled node in C is activated or neutralized at least 
once. 

Remark 1: To simplify the presentation, we will assume 
throughout the rest of the paper that the arrays containing the 
IDs of the children of each node and copies of their height 
values are consistent with the height values stored by the 
children themselves in the current configuration of the system. 
It should be clear that maintaining these copies up to date can 
be achieved with a constant overhead per execution step. 

IV. Self-Stabilizing Balancing Solution 

In this section, we present our self-stabilizing algorithm for 
balancing containment-based trees, we provide some termina- 
tion properties, and we prove that any execution converges to 
a balanced tree. 

Assuming that each node knows the correct heights of its 
subtrees, a very simple distributed self-stabilizing algorithm 
balances the tree: each node uses the swap operation whenever 
two of its children heights are "too different". However, in 
a distributed context, height information may be inaccurate. 
This inaccuracy could lead the aforementioned naive balancing 
algorithm to make some "wrong moves." For example, in 
Figure |2] assume that c "thinks" that its own height is 4 
and e "thinks" that its own height is 9 while their actual 
heights are respectively 10 and 8; the illustrated swap would 
actually unbalance the tree. On one hand, maintaining heights 
in a self-stabilizing fashion is easy and ensures that height 
information will eventually be correct. On the other hand, 
no node can know when height information is correct; as a 
consequence, height maintenance and balancing have to run 
concurrently. But their concurrent execution raises an obvious 
risk: the swap operation modifies the tree structure and could 
thus compromise the convergence of the height maintenance 
subprotocol. 

Basically, the algorithm that we propose in this section 
consists of two concurrent actions: one maintaining heights, 
the other one balancing the tree. We formalize both actions and 
prove that their concurrent execution converges to a balanced 
tree. 

A. Algorithm 

At any time t, each node u is able to evaluate the following 
functions and predicates for itself and its children: 

• max(a:): returns any v G Sx{t) such that hy{t) — 
max^,g5^(f) huj{t). If X is a leaf, returns ± (undefined). 

• min(a;): returns any v £ Sx{t) such that hi,{t) ~ 
iniiitug5^(t) h^{t). If a; is a leaf, returns ±. 



> stable(a;): returns true if and only if hx{t) = 1 + 
ma-Xu^g^^j-j) hw{t). If X is a leaf, returns true if and only 
if hx{t)=Q. 

> balanced(x): returns true if and only if for all z,z' e 
Sx{t), \K{t)~h,,{t)\ < a. 

When there are more than one possible return values 
for max(x) and min(x), an arbitrary choice is made. For 
simplicity, we can consider that the candidate node with the 
smallest ID is returned, although this will not be crucial for 
our results. 

Each node u executes the algorithm in Figure |3] (the value 
of stablG(±) is assumed to be false). Note that the guards Gl 
and G2 are mutually exclusive and, by definition, the height 
update action effectively has priority over the edge swapping 
action. We say that a node is enabled for a height update if 
Gl is true, or enabled for a swap if G2 is true. 

When a node u performs a swap, four nodes are involved: u 
itself, max(u), min(ii), and max(max(w)). We refer to these 
nodes as the source, the target, the swap-out, and the swap-in 
nodes of the swap, respectively. 

The following proposition can be proved directly from the 
definitions. We state it without proof. 

Proposition 1: If C is a rooted tree with root r, then after 
any execution step C — >a C , C is still a directed tree with 
root r. 

B. Termination Properties 

In the following section, we will prove that for any initial 
configuration, every possible execution is completed in a finite 
number of steps. For the moment, we give two properties of the 
resulting tree, assuming of course that the execution consists 
of a finite number of steps. 

The proof of the following proposition can be found in 
Appendix |A] 

Proposition 2: If the execution is completed, then in the 
final configuration C{t*) all nodes are balanced and have 
correct height information. 

The next proposition, regarding the height of the resulting 
tree, follows directly from the analysis in llT4l Sections II 
and III]. In our case, the initial conditions of the recurrence 
studied in |14] are slightly different, but this does not affect 
the asymptotic behavior of the height. 

Proposition 3: If the execution is completed in time t*, then 
in the final configuration h*{t*) = (logn), where r is the root 
and n is the number of nodes in the system. 

C. Proof of Convergence 

We give an overview of the proof with references to the 
appropriate appendices for the technical parts. 

The concept of a "bad node" will be useful in the analysis 
of the algorithm. 

Definition 4 (Bad nodes): In a given configuration C{t), an 
internal node m is a bad node if < max^g^^i-j) hy{t). A 
leaf is a bad node if hu{t) < 0. 

Intuitively, a bad node is a node that "wants" to increase its 
height value. The proof of the following key lemma can be 
found in Appendix IB] 
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Guard 



Statement 



Gl -istable(u) SI hu := 1 + max„,g5^^ (or hu if u is a leaf) 

G2 stable(u) A stable(max(M))A ^balanced(u) S2 swap edges {u,mm{u)) and (max(u), max(max(u)) 

Fig. 3: Distributed self-stabilizing balancing algorithm 



Lemma 4: If C{t) contains no bad nodes, then for all t' > t, 
C{t') contains no bad nodes. 

It will be convenient to view any execution of the algorithm 
as consisting of two phases: The first phase starts from the 
initial configuration and ends at the first configuration in which 
the system is free of bad nodes. The second phase starts at 
the end of the first phase and ends at the first configuration 
in which no node is enabled, i.e., at the end of the execution. 
In view of Lemma HI the system does not contain any bad 
nodes during the second phase. We will prove that each phase 
is concluded in a finite number of steps, starting from the 
second phase. 

1) Second Phase: We prove convergence for the second 
phase by bounding directly the number of height updates and 
the number of swaps that may occur during that phase. The 
fact that there are no bad nodes in the second phase is crucial 
for bounding the number of height updates. It follows that the 
number of swaps also has to be bounded, since a long enough 
sequence of steps in which only swaps are performed incurs 
more height updates. The detailed proofs of these claims can 
be found in Appendix |C] 

Lemma 5: Starting from a configuration with no bad nodes, 
no execution can perform an infinite number of height updates. 

Lemma 6: Starting from a configuration with no bad nodes, 
no execution can perform an infinite number of swaps. 

Lemmas |5] and |6] directly imply the following: 

Theorem 7: In any execution, the second phase is com- 
pleted in a finite number of steps. 

2) First Phase: For the sake of presentation, it will be 
helpful to sometimes consider that the root of the tree has 
an imaginary father r, which is never enabled and always a 
bad node. 

Definition 5 (Extended configuration): We denote by C{t) 
an auxiliary extended configuration at time t, which is identical 
to C{t) except that the root node in C{t) has a new father node r 
with h^{t) = -oo, for all i > 0. 

The bad nodes induce a partition of the nodes of the extended 
configuration into components: each bad node belongs to a 
different component, and each non-bad node belongs to the 
component that contains its nearest bad ancestor 

Definition 6 (Partition into components): For each bad 
node b in C{t), the component Tb{t) is the maximal weakly 
connected directed subgraph of C{t) that has b at its root and 
contains no other bad nodes. 

A useful property of this partition is that it remains un- 
changed as long as the set of bad nodes remains the same. 
Therefore, in any sequence of steps in which the set of bad 
nodes remains the same, each component behaves similarly 




Fig. 4: An illustration of the notions introduced in Definitions |4] 
|5] [S] and |7] Node labels indicate their height values. Circled nodes 
represent bad nodes. The dashed edge exists only in the extended 
configuration and connects the real root of the tree to the artificial 
node r. Each group of nodes is one component of the partition. The 
badness vector corresponding to this configuration is (3, 7, 2, 1, 4). 

to a system that does not contain bad nodes. In particular. 
Lemmas |5] and |6] imply that each component is stabilized in 
finite time, and thus this sequence of steps cannot be infinite. 
For a complete substantiation of these claims, please refer to 
the proof of the following lemma in Appendix |D] 

Lemma 8: There cannot be an infinite sequence of steps in 
which no bad node is activated or becomes non-bad. 

We associate a badness vector with each configuration. This 
vector reflects the distribution of bad nodes in the system and 
will serve to quantify a certain notion of progress toward the 
extinction of bad nodes. In particular, we will prove that the 
badness vector decreases lexicographically in every step in 
which at least one bad node is activated or becomes non-bad. 

Definition 7 (Badness vector): For t > 0, let 
6i, 62, ■ • ■ , ^\B{t)\ ordering of the bad nodes in C{t) by 

non-decreasing number of bad nodes contained in the path 
from r to bi, breaking ties arbitrarily. Note that 61 = r. We 
define the badness vector at time f > to be the vector 

where the size of a connected component is the number of 
nodes belonging to that component. 

We refer the reader to Figure 0] for an example of an ex- 
tended configuration, its partition into components, and the 
corresponding badness vector 

Definition 8 (Lexicographic ordering): Consider two bad- 
ness vectors b — {xi, . . . ,Xk) and b' — {x[, . . . , x'f^,). We 
say that b' is lexicographically smaller than b if one of the 
following holds: 

1) k' < k, or 
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2) k' ~ k and for some i in the range 1 < i < k, x'^ < Xi 

and a;^ — Xj for all j < i. 
The proof of the following lemma can be found in Ap- 
pendix [E] 

Lemma 9: If at least one bad node is activated or becomes 
non-bad in step C{t) — > C{t + 1), then b{t + 1) is lexico- 
graphically smaller than b{t). 

Lemmas [8] and |9] are the essential ingredients required to 
prove convergence for the first phase. Appendix |F] contains 
the full proof. 

Theorem 10: In any execution, the first phase is completed 
in a finite number of steps. 

V. Simulation 

To investigate the dynamic behavior and properties of our 
algorithm, we implemented a round based simulator Each 
simulation (i) builds a full binary tree, (ii) initializes heights, 
and (iii) runs the balancing protocol. We used a synchronous 
daemon to run simulations, i.e., all the enabled nodes of the 
system execute their enabled actions simultaneously. The im- 
pact of non-deterministic daemons and/or the weaker daemon 
that is required to run our algorithm will be tackled in future 
work. 

In the following, for a given simulation we will denote by n 
the number nodes of the tree, by hi its initial height (i.e. after 
generation), by /i/ its final height (i.e. after balancing) and 
by t the execution time in rounds. 

A. Almost Linear Trees 

Intuitively, almost linear trees are stressing for a balancing 
algorithm because they are "as unbalanced as possible". In 
the following, for a given n, the initial tree is the structurally 
unique full binary tree of height \n/2]. The only unspecified 
part of the simulation is the initial height values. It turns out 
that this has practically a very small impact on termination 
time. For each n we ran thousands of simulations starting 
from the corresponding linear tree. For a given n they always 
converged in the same number of rounds. As a consequence, 
all runs starting from the same linear tree will have approxi- 
mately the same results. 

Figure |5] shows the termination time of simulations for 
different numbers of nodes. It contains two curves: the first 
one is plotted from our experiment and each point stands for 
an almost linear tree. The second curve is the sum of initial 
and final tree heights. 

This plot tends to show that the round complexity of our 
algorithm is (n) in the worst case. 

B. Random Trees 

To showcase the applicability of our algorithm to an existing 
system, trees are generated using the join protocol of [6 |. 

Figure |6] shows the distance between the sum of initial 
and final tree heights and experimental termination times. The 
vertical axis gives the average variation between hi + hf and 
experimental termination times. For each n we ran thousands 
of simulations. Each candlestick sums up statistics on those 
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Fig. 5: t for almost linear trees 
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Fig. 6: t for random trees 



runs; the whiskers indicates minimum and maximum variation, 
the cross indicates the average variation and the box height 
indicates the standard deviation. 

The greater n is, the closer hi + hf and experimental results 
are. This result indicates that the average round complexity of 
our algorithm is {hi + hf). 

VI. Concluding Remarks 

In this work, we propose a new distributed self-stabilizing 
algorithm to rebalance containment-based trees using edge 
swapping. Simulation results indicate that the algorithm is 
quite efficient in terms of round complexity; in fact, it seems 
that we can reasonably expect (n) to be a worst-case bound, 
whereas in the average case the running time is closer to 
{hi + hf) rounds. Interestingly, this average-case bound also 
appears in a different setting in |13 |. Note that the conjectured 
average-case bound is close to {\ogn) in a practically relevant 
scenario in which some faults appear (or new nodes are 
inserted) in an already balanced tree. 

We have assumed that nodes keep correct copies of the 
height values of their children, so that each node can read the 
height values of its grandchildren by looking at the memory 
of its children. For simplicity, we have not dealt with the 
extra synchronization that would be required to maintain these 
copies up-to-date, but it should be possible to achieve this with 
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a constant overhead per execution step. Furthermore, we have 
assumed that internal nodes have degree at least two. Degen- 
erate internal nodes with degree one could be accommodated 
by a bottom-up protocol that runs in parallel and essentially 
disconnects them from the tree, attaching their children to their 
parents. Finally, note that in Section HIl we remarked that edge 
swaps may have semantic impact if they rearrange nodes so as 
to violate the containment relation. This can also be fixed by 
another bottom-up protocol that restores each node's label to 
the minimum that suffices to contain the labels of its children. 

Possible directions for future work include establishing the 
conjectured upper bounds of (n) and {hi + hf) for the round 
complexity. We already have some preliminary results in this 
direction: the first phase of the algorithm (refer to Subsec- 
tion IIV-CI ) is indeed concluded in (n) rounds. An extension 
of this work would be to adapt the proposed algorithm in the 
message passing model. 
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Appendix 

A. Proof of Proposition \2\ 

We prove that each node u is balanced and has correct 
height information by induction on the actual height of u 
in C{t*). 

If = 0, then w is a leaf and since it is not enabled, 

huit*) = 0. Therefore, u has correct height information and 
is trivially balanced. Assume that all nodes of actual height k 



or less, where fc > 0, are balanced and have correct height 
information in C{t*). Consider a node u at actual height fc + 1. 
Since u is not enabled for a height update, hu{t*) = 
max^gg^^f.) hy{t*) + 1. But all children of u are at actual 
height k or less, therefore the inductive hypothesis and the last 
equality imply: K{t*) = max„g5^^(4*) h%{t*) + 1 = 
Moreover, u is not enabled for a swap, which means that for 
all children v of u, hy{t'*) > hu{t*) — 1 — a, and because they 
have correct height information, h*{t*) > h^{t*) — 1 — a. 
Therefore, u has correct height information and is balanced. 

B. Proof of Lemma |4] 

We will use the following notation in this section and in 
Appendices |D] and |E] Let P = C — >a C be a step of 
the execution. The set A is partitioned into subsets Ah, Ab, 
and As, where Ah is the set of non-bad nodes which perform 
a height update, Ai, is the set of bad nodes which perform a 
height update, and As is the set of nodes which are sources of 
a swap. Let At be the set of nodes which are targets of a swap 
and let B and B' denote the set of bad nodes in configuration C 
and C , respectively. For each node u, let 5„ be the set of 
children of m in C and 5* be the set of nodes of the subtree 
rooted at u in C, and let and 5*' be the corresponding sets 
in C' . Finally, let Q{Ah) be the subgraph induced by Af, in the 
configuration C. 

The following lemma states some very basic properties 
that are easily derived from the definitions. It will be used 
implicitly throughout the proofs. We state it without proof. 

Lemma 11 (Easy properties): 1) Ab C B. 

2) Asr\Ah = Asr\B = Atr\Ah^AtnB^%. 

3) For all nodes u, S'^ = iS„ if and only if u ^ As^ At- 

4) The set of leaves in C is equal to the set of leaves in C. 
Lemma 12: Let f be a weakly connected component 

of Q{Ab). In C , the nodes of £ still induce a weakly connected 
subgraph which is identical to £. No leaf of £ belongs to B' . 

Proof: No node of £ is in Ag U Aj, therefore 5^, = 
for all nodes v of £. This suffices to prove that the nodes of £ 
induce in C a weakly connected subgraph that is identical 
to £. 

Now, let u be a leaf of £. If u is also a leaf in C, then its 
activation has set its height variable to in C . But u is still 
a leaf in C , therefore it now has the correct height value and 
therefore u ^ B' . If u is not a leaf of C, it means that Su is 
non-empty and iSuflAb = (if u G iS„nAb, then v would also 
be in £ and u would not be a leaf of £). We deduce that each 
node in Su either decreases or retains its height variable in C 
Moreover, by definition of the height update action, node u 
adjusts its height variable to at least one greater than any of the 
values of the height variables of its children in C. We conclude 
that u^B'. ■ 

Lemma 13: If u ^ AU B and H At ^ 9, then u ^ B'. 
Proof: Since u is not activated, the value of its height 
variable remains the same in C Assume, first, that u ^ At- 
Then, 5'^ = Su (since u ^ Ag, either). Moreover, S'^nAi, = 0, 
therefore all nodes in Su = S' either decreased or retained 
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their height variables. From these observations and the fact 
that u ^ B, we conclude that u ^ B' . 

Now, assume that u e At, and let x ^ At he its new 
child in C. By the definition of the swap guard, the original 
height variable of node x is strictly smaller than the height 
variable of u, and it may have decreased even further if x was 
simultaneously activated for a height update. All other chiken 
of u in C were also children of u in C, and, by the fact that 
S'^n Af, ^ 0, we know that they either decreased or retained 
their height values. From these observations and the fact that 
u ^ B, we have that u ^ B' . ■ 

Lemma 14: If u G Ah and S'^r\A}, = 0, then u B' . 
Proof: Since u ^ Ah, we know that u ^ As At and 
therefore S'^ ~ Su- Since S'^C^ A^ = 0, all its children either 
retain or decrease their height variable, whereas the height 
variable of u is adjusted to at least one greater than any of the 
original values of the height variables of its children. Thus, 
u^B'. ■ 

Lemma 15: If u ^ As and S'^f] Ab = 0, then u ^ B' . 
Proof: Let x ^ Ah he the new child of u in C . By 
definition of the swap guard, we know that the height variable 
of X in C is strictly smaller than that of u. All other children 
of u were also children of w in C and they have either retained 
or decreased their height variable, therefore u ^ B' . ■ 

Lemma 16: \B' \B\< \Ab \ B'\. 

Proof: To each node x G B' \ B, we associate a 
node y G \ _B' as follows: By Lemmas [13] [141 and [15] a 
non-bad node x in C can be turned into a bad node in C only if, 
in C, it is the father of some node 6 G Af,. In fact, this node b 
must be the root of some weakly connected component £ 
of G{Ab): Otherwise, b would have a father u ^ Ab in C, 
and by Lemma [12] u would still be the father of b in C, thus 
u = X. This contradicts with the fact that x ^ B. We define y 
to be any leaf of £, which, by Lemma [12] became non-bad 
in C. 

To conclude the argument, note that two distinct nodes 
x,x' G B' \B must be the fathers of the roots of two distinct 
components of Q{Ab), and therefore they are associated to two 
distinct nodes y and y' . ■ 

Lemma 17: \B'\ < \B\. 

Proof: One can easily verify that, for any sets B and B', 
\B'\ - \B\ = \B'\B\ - \B\B'\. Since Ab C B, we have that 
\Ab\B'\ < therefore < \B'\B\-\Ab\B'\. 

This, combined with Lemma [T6l yields |-B'| < |-B|. ■ 

Lemma [4] follows immediately from Lemma [Tt] 

C. Missing Proofs from Subsection W-CA 

Lemma 18: For any node u, if hu{t) < h*^^{t), then there 
exists in C{t) at least one bad node in the subtree rooted at u. 

Proof: By induction on the actual height of u. If h^{t) = 
0, then u is a leaf and, by assumption, hu{t) < h^{t) = 0. 
Therefore, u is a bad node. Now, assume that the statement 
holds for all nodes with actual height at most k, where fc > 0. 
Consider a node u with /i* {t) = k + 1 and let v he one of its 
children with actual height h*{t) — k (at least one such child 
must exist). We can assume that u is not bad, otherwise the 



claim is proved. Since u is not bad, hy{t) < hu{t) — 1. By 
assumption, < — 1, thus we get hy{t) < h^{t) — 

2 — k — 1. Since the actual height of v is fc, we have hy{t) < 
h* [t) and, by the inductive hypothesis, there exists a bad node 
in the subtree rooted at u. ■ 

Lemma 19: In the second phase, if a node becomes enabled 
for a height update, it will remain enabled for a height update 
at least until it is activated. 

Proof: Note that a node that is enabled for a height update 
cannot be the source or the target of a swap, therefore its set 
of children does not change while it is enabled for a height 
update. Moreover, its children cannot increase their own height 
values, so the node will remain enabled for a height update at 
least until its activation. ■ 
Proof of Lemma [5] Consider an execution of the al- 
gorithm in which an infinite number of height updates are 
executed. Since the number of nodes is finite, at least one node 
must execute a height update an infinite number of times. By 
the fact that the initial configuration contains no bad nodes and 
by Lemma [4] each time that node executes a height update, its 
height variable decreases. At some point, its height variable 
will become negative and at that point, by Lemma [TSl some 
node in its subtree will become bad. This contradicts with 
Lemma H] ■ 
Proof of Lemma [6] By Lemma [5] there exists a finite 
time to after which no height updates are performed. For each 
node u, let hu denote the value of its height variable at time to 
and, since it remains constant thereafter, at all subsequent 
times. Furthermore, for t > to, let Su{t) denote the set of 
children of u at time t whose height variable is equal to — 1. 

Note that u is enabled for a height update at time t if and 
only if |5ti(t)| = 0. We observe now that in every step, if u is 
the target of a swap then |5u(<)| is decreased by 1, otherwise 
it remains the same. By Lemma [T9l if becomes then 

it remains equal to until u performs a height update. 

Suppose, now, for the sake of contradiction, that an infinite 
number of swaps are performed after time tQ. For each 
swap, there exists a node that is the target of that swap. It 
follows, then, that after at most J^u \^u{to)\ swaps have been 
performed after time to, all nodes in the system will be either 
idle or enabled for a height update (idle nodes will include 
nodes that are so low in the tree that they cannot possibly 
be the source of a swap and the root of the tree which will 
not be able to perform a swap since all of its children will 
be enabled for a height update). At that point, either all nodes 
are idle and thus the execution is completed, which contradicts 
with the fact that an infinite number of swaps are performed 
after time to, or the only choice of the scheduler is to activate 
a node for a height update, which contradicts with the fact 
that no height updates are performed after time to- ■ 

D. Proof of Lemma \8\ 

In this section, we use some notation introduced in Ap- 
pendix |B] Additionally, in this section and in Appendix (E] 
we will use the following notation. Let C and C denote the 
extended configurations corresponding to C and C, and let B 
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and B' denote the corresponding sets of bad nodes. Moreover, 
let {Tb}beB and {TJJ'Ibes' be the partition of nodes into 
components for the two configurations, and if T is any such 
component, let V{T) denote its set of nodes. 

Definition 9 (Swap chain): A swap chain in step P is a 
maximal-length directed path uq, . . . (cr > 1) in config- 
uration C such that wq, . . . , Uo--i G As, ui, . . . ,Ua- G At, and, 
if (T > 2, W2, . . . , Ucr are the swap-ins of the swaps performed 
by nodes uq, . . . , Wo--2, respectively. 

Lemma 20 (Properties of swap chains): Let Wq, . . . , Mo- be 
a swap chain in step P. 

1) 

2) In C' , the nodes of even order in the swap chain 
(mo,U2,...) induce a directed path starting from uq. 
Similarly, the nodes of odd order in the swap chain 

M3, . . . ) induce a directed path starting from ui. 

3) Ur=o5;=ULo'5«,. 

Proof: Property [T] follows immediately from the fact that 
mq is the first node in the swap chain, therefore it is not the 
target of any swap operation. Property |2]follows from the fact 
that M2, M4, . . . are the swap-in nodes for the swaps performed 
by nodes uo,U2,... respectively and, similarly, ^3,7/5,... 
are the swap-in nodes for the swaps performed by nodes 
iti,M3, . . . respectively. For Property |3] note that the nodes 
in the swap chain exchange children only with other nodes in 
the swap chain. ■ 

Lemma 21: If u <^ At, then 5*' — S*. 

Proof By induction on \S*'\. If \S*'\ = 1, then u is a 
leaf in C, thus also in C, and S*' = S* ~ {u}. Assume that 
the statement holds for all nodes whose subtree contains at 
most k nodes in C, where fc > 1. Let u ^ At he a node for 
which \S*'\ = fc + L 

If M ^ As, then = 5„. Moreover, for all z e S'^, \S*'\ < 
\S*'\ and z ^ At, since their father was not the source of 
a swap. Therefore, the inductive hypothesis applies to each 
node 2: G 54 and we get: 

If M e As, then u must be the origin of a swap chain. Let Vch 
be the node set of that swap chain and let iSch = (Uwey ^v) \ 
and = (U.ey.,, '^'O \ ^ch- By LemmaEi S'^^= 5ch. 
Moreover, each node z e S'^^^ is in the subtree of u in C and 
z ^ At- Therefore, the inductive hypothesis appUes to each 
node z G S'^^^ and we get: 

s*j = u y s:' = u y si - si . 

■ 

Lemma 22: If B' = B, then V{T^) = V{Tb), for all 6 G S. 
Proof: Note that, for each b G B' , we can write V{T^) 
as follows: 

v(T^) = st'\ y s:: . 

ueB'r\Sl'\{b} 



Since b E B' = B, we must have b ^ At- Therefore, by 
Lemma 1211 each b <E B' satisfies S^' = S^. We get, then, that 

mo-^n y s: = vi%). 

ueBns*\{b} 

■ 

By Lemma |22] in a sequence of steps in which no bad 
node is activated or becomes non-bad, each of the connected 
components behaves in the same way as a tree that does not 
contain bad nodesH Therefore, by Lemmas |5] and |6] each 
component will stabilize in finite time and the bad nodes will 
be the only candidates for activation. Lemma |8] follows. 

E. Proof of Lemma \9\ 

In this section, we use some notation introduced in Appen- 
dices |B] and |D] Additionally, in this section we will use the 
following notation. Let b and b' denote the badness vectors 
corresponding to C and C. 

Lemma 23: If at least one bad node becomes non-bad 
without being activated in step P, then \B'\ < \B\. 

Proof Let N be the set of bad nodes that become non- 
bad without being activated. We can partition the set B as 
follows: 

B ^ N(j{Ab\B')U{Br\B') . 
Moreover, we can naturally partition the set B' as follows: 

B' ^ {B' \B)[j{Br\B') . 

If |iV| > 0, then from the first equation we get \B\ > 
\Ab \ B'\ + \B n B'\, and then from the second equation 
and Lemma [T6] we have that \B'\ = \B' \ B\ + \B n B'\ < 
\Ab\B'\ + \Br]B'\ <\B\. m 

Lemma 24: If there exist nodes u and v such that v E S'^^D 
Ab and u e B\ A or u ^ B', then \B'\ < \B\. 

Proof: We will prove the statement by demonstrating an 
injection from B' to B \ {v}. Consider the function f : B' ^ 
B where, given x G B' , f{x) is defined as follows: 

• If X e B\Ab, then f{x) = x. 

> Otherwise, f{x) — y where y is any child of x in C such 
that y e At. 

We need to prove that the function / is well-defined and 
injective. For injectivity, it suffices to show that if x ^ B\Ab, 
then f{x) ^ B \ Ab- Indeed, given an x ^ B' such that 
X ^ B \ Ab, we distinguish two cases: If a; ^ _B, then, by 
Lemmas [13] [HI and [15] a; has a child y e Ab in C. On the 
other hand, if x e Ab, then by Lemma [12] we know that x 
is not a leaf of the component of G{Ab) in which it belongs 
and thus it has a child y G Ab in C. Clearly, in both cases, 
y(^B\Ab. 

^That is slightly inaccurate: Ceitain nodes of the component may contain in 
their children set some bad nodes, which are at the root of other components. 
From the point of view of the father's component, these bad nodes will behave 
as if they are leaves whose height value is fixed to some arbitrary value, 
smaller than the height value of their father. However, it should be clear that 
this does not change the fact that the component will stabilize after a finite 
number of steps. 
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It remains to show that no x ^ B' is mapped to v. The only 
candidate nodes that could be mapped to v are v itself and u, 
the father of v in C. We know that v G A},, which implies 
that V ^ B \ Ab, and thus f{v) ^ v. As for u, we have two 
cases: If m ^ B' , then u is not even in the domain of /. If 
u e B' , then, by assumption, we must have u ^ B \ A and 
thus f{u) = V. ■ 

For the proof of Lemma |9] we can assume that h' and h are 
of equal length, otherwise the statement holds by Lemma [TtI 
Moreover, we can assume that Ai, ^ 0, otherwise the statement 
holds by Lemma |23] Since r Ah, there exists a bad node 
in C whose corresponding component contains the parent of a 
node in At,. Let hj, j > 1, be the first such bad node in the 
ordering of Definition |7] 

By definition of bj, we have that bi, . . . ,bj ^ Ai,. Therefore, 
by Lemma |23] bi,...,bj e B'. By Lemmas [131 [3 and [131 
a non-bad node can be turned into a bad node only if, in C, 
it is the father of a node in Af,. This implies that bi, . . . ,bj 
are the first j bad nodes in C according to the ordering of 
Definition [71 

By definition of bj and Lemma [23l we also know that any 
child of any node in any of the components Tb^, ■ ■ ■ that 
was in B, is also in B'. Therefore, for each i < j, |7^' | = \Tbi \. 
Finally, by Lemma [24l we can assume that bj itself is not the 
father of any node v G Ab and that, for any such node v 
whose father w was in % , at least w no longer belongs to 
7^;. Therefore, \%\ < \Tbl\. 

F. Proof of Theorem [70| 

By Lemma [51 the badness vector decreases lexicographi- 
cally whenever at least one bad node is activated or becomes 
non-bad. By Lemma [HI we cannot have an infinite sequence 
of steps in which no bad node is activated or becomes non- 
bad. Moreover, during such a sequence of steps, the set of 
bad nodes remains the same by Lemmas [T3l [141 and [151 and 
thus the badness vector remains the same by Lemma [22l The 
theorem follows from these observations and the easy fact that 
no configuration can have a corresponding badness vector that 
is lexicographically smaller than the single-component badness 
vector (n + 1), where n is the number of nodes in the system. 
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